A gradient is a vector of partial derivatives with respect to each variable
eg: \[ g(x,y)=x^{2}y \] \[ {\nabla}g=\begin{bmatrix} \frac{\partial{f}}{\partial{x}} \\ \frac{{\partial}f}{{\partial}y} \end{bmatrix}=\begin{bmatrix} 2xy \\ x^{2} \end{bmatrix} \] ## 1-D Optimization (Hill Climbing)
Can sample random points \(w\) and take the \(w\) that maximizes the estimation
Local hill climbing
Evaluate derivative to find the direction to step in: \[ \frac{{\partial}g(w_{0})}{{\partial}w}=\lim_{h\to0}\frac{g(w_{0}+h)-g(w_{0}-h)}{2h} \]
Perform update in uphill direction for each coordinate
The steeper the slope (the higher the derivative) the bigger the step for that coordinate
eg given \(g(w_{1},w_{2})\) and hyperparameter for learning rate \({\alpha}\): \[ \begin{array}{c} w_{1}\leftarrow{w_{1}+{\alpha}*\frac{{\partial}g}{{\partial}w_{1}}(w_{1},w_{2})} \\ w_{2}\leftarrow{w_{2}+{\alpha}*\frac{{\partial}g}{{\partial}w_{2}}(w_{1},w_{2})} \end{array} \]
Same as \[ {\nabla}_{w}g(w)=\begin{bmatrix} \frac{{\partial}g}{{\partial}w_{1}}(w) \\ \frac{{\partial}g}{{\partial}w_{2}}(w) \end{bmatrix} \] \[ w\leftarrow{w+{\alpha}*{\nabla}_{w}g(w)} \]
Alg ∈ \(O(n)|_{n\,\coloneqq\,|\text{dimensions}|}\)
\({\alpha}\) needs to be chosen carefully (generally s.t. update changes \(w\) by 0.1-1%)
\[ w\leftarrow{w+{\alpha}*\sum_{i}{\nabla}{\log}(\mathbb{P}(y^{(i)}|x^{(i)};w))} \] \[ \max_{w}ll(w)=\max_{w}g(w) \]
Gradient over small set of training examples (mini-batch) can be computed in parallel \[ \left.w\leftarrow{w+{\alpha}*\sum_{j\in{J}}\nabla\log\mathbb{P}(y^{(j)}|x^{(j)};w)}\right|_{J\;\coloneqq\;\text{random subset of training examples}} \] \[ \max_{w}ll(w)=\max_{w}\sum_{i}\log{\mathbb{P}(y^{(i)}|x^{(i)};w)} \] ## Deep Learning and Neural Networks
Goal ≙ avoid manual feature design; allow computer to learn features too
Finally: \[ y={\phi}(w_{1}h_{1}+w_{2}h_{2})=\frac{1}{1+e^{-(w_{1}h_{1}+w_{2}h_{2})}} \]
Same as:
Multi-layer Neural Network: \[
\left.z_{i}^{(k)}=g\left(\sum_{j}W_{i,j}^{(k-1,k)}z_{j}^{(k-1)}\right)\right\vert_{g\;\coloneqq\;\text{nonlinear
activation function}}
\]
- Multilayer network
with batches: